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Abstract: This paper deals with two-sample Kolmogorov-Smirnov test and its biasedness. This test is 
not unbiased in general in case of different sample sizes. We found out most biased distribution for some 
values of significance level a. Moreover we discovered that there exists number of observation and signifi- 
cance level a such that this test is unbiased at level a. 
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1 Introduction 
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In the world of statistic, there exists an enormous number of tests and new ones are going to be derived. For 
most of these tests we know, that they are consistent, we know their asymptotic behavior and a lot of another 
properties. But there is one thing which is often omitted. This thing is unbiasedness. 

Somebody can think, that all of the tests, which are used, are unbiased or are biased against very special 
alternative which can not occur in practical applications. Somebody can look at unbiasedness as at very poor 
power of tests against some alternatives and somebody can just thing that unbiasedness is unimportant. But 
they are all wrong. We often check some assumptions of test by other tests. But what if the checking test is 
biased and therefore it leads to the bad decision? Then the main test should not be used and it can lead to 
wrong decision. Therefore, unbiasedness should not be underestimate. 

There are a lot of tests which are really unbiased. But there are plenty of tests that are used daily and 
they are biased. One of such tests is well known two-sample Kolmogorov-Smirnov test. In what follows, 
\ we look at biasedness and unbiasedness of this test in some cases in detail. 
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Firstly, we should recall, what unbiasedness is. A test is said to be unbiased at level a if 

1 . it has significance level a 

2. for all distributions from alternative the power of this test is greater or equal to a. 



The test is said to be unbiased if it is unbiased at all level a e (0, 1). Finally, the test is said to be biased 
if it is not unbiased. Specially, the test is biased at level a against alternative G if it is an level a test and 
P(rejectff|G) < a. 

Consider, that X\,...,x„ and yi, . . . ,y„, are two independent samples having distributions with continu- 
ous distribution functions F and G, respectively. We would like to test the hypothesis H : F — G against the 
alternative A : F + G. Then two-sample Kolmogorov-Smirnov test is based on statistic 

£>„,,„ = sup \P„(x) - G m (x)\, 

x 

where F n (x) and G m (x) are empirical distribution functions of F and G. The hypothesis H is re jected for 



large value of D nm . The exact formula for computing p-values can be found in \Haiek et al.\il99 
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Firstly, we should realize that statistic D n>m of two-sample Kolmogorov-Smirnov test has discrete distri- 
bution. Therefore p-values for this test are discrete as well. For example consider that n — m — 50. Then 
the test statistic D nM can take just 50 different values 1 /n,2/n, . . ., 1. For statistic D nM = 0.26 the p-value is 
equal to 0.0678 and for the next value D, h ,„ = 0.28 the p-value is equal to 0.0392. Testing at level a = 0.05 
could be little bit confusing because the power of this test is equal for each value a e [0.0392, 0.0678). 
There exists distribution G such that power of Kolmogorov-Smirnov test at level a = 0.05 is equal to 0.045. 
Such distribution does not meet requirements of definition of unbiasedness for a = 0.05 though the power 
of this test is higher than exact level of this test equal to 0.0392. To hold the idea of unbiasedness for tests 
with discrete test statistic we should consider just discrete values of significance level a or use randomized 
versions of these tests. 

It should be kept in mind that Kolmogorov-Smirnov test does not depend on monotonic transformation 
of samples. If we transform both samples (by the same monotonic transformation) to samples with distribu- 
tion functions F' and G', respectively then sup r \F„(x) - G m (x)\ = sup v \F' n (x) - G' m (x)\. Therefore without 
loss of generality, we assume that F is distribution function of uniform distribution given by 



F(x) = i if0<i<l . (1) 




In G ordon and Klebano)\fa01<& . they proved that for n = m there exist a e (0, 1) such that two-sample 



Kolmogorov-Smirnov test is unbiased at level a against two-sided alternative F + G. If we consider just 
one-sided alternatives A\ : F < G or A 2 : F > G we can extend this founding to n ± m. 

Theorem 2.1. Let X\,...,x n and yi,...,y m be independent samples from distribution F and G. Then for 
arbitrary n,m € N, there exists a € (0, 1) such that two-sample Kolmogorov-Smirnov test of hypothesis 
H : F — G against one-sided alternative A\ : F < G or A2 : F > G is unbiased at level a. 

Proof. Without loss of generality, we consider that the first sample x\, . . . , x n is from uniform distribution. 
Firstly, we consider only the alternative A 1 . F < G. For this alternative, the Kolmogorov-Smirnov statistic 
is given by D* m = sup te(0 ^ (P n (x) - G m (x)^j, where F„ and G m are empirical distribution functions of F and 
G. The hypothesis H is rejected for small values of D* n m . Consider a such small, that we reject hypotheses 
H for D„ m equals to minus one. It occurs if and only if the samples X\,...,x„ and y\,...,y m satisfy 

max(y 1 , . . . , y m ) < min(xi , . . . , x„). (2) 

The probability of this event is given by 



Jo 



(1 - x) n - l G m (x)dx. (3) 



Moreover, G(x) is monotone and G(x) > x because we consider alternative A\ : F < G. Therefore the 
function (1 -x) n ~ l G m (x) of integral (0 attains its minimum for G(x) = x. This integral represents probability 
of rejection of hypothesis at level a if alternative G is true and it is minimized for F — x — G(x). Hence, 
Kolmogorov-Smirnov test is unbiased at level a. 

The proof for alternative A2 '■ F > G is similar. We take a such small, that we reject hypothesis if and 
only if D„ „, = 1 . The inequality (f2]l change to 

max(xi, . . . ,x„) < min(yi, . . . , y m ) 
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and probability of this event is then given by 



n f x"- 
Jo 



\\-G(x)) m dx (4) 



For alternative A2, we have G(x) < x and hence integral © is minimized for G(x) = x. It proves the 
theorem. □ 

The result of this theorem does not mean that two-sample Kolmogorov-Smirnov test is unbiased against 
one-sided alternative. It only says that there exist small level a for which this test is unbiased. In the 
following theorem we show that for n + m two-sided Kolmogorov-Smirnov test is not unbiased against 
two-sided alternative. 

Theorem 2.2. Let x\,...,x n be i.i.d from uniform distribution with distribution function F and y\, ...,y m 
be i.i.d. from distribution having distribution function G. If n + m then there exists a € (0, 1) such that two- 
sample Kolmogorov-Smirnov test of hypothesis H : F — G is biased against alternative with the distribution 
function 

G(x) = „_, . (5) 

l+(^ 

Proof. Consider a such small, that we reject hypotheses if and only if D„_ m = sup t \F n (x) - G m (x)\ is equal 
to one. That is, the samples x\, . . . , x„ and y\ , . . . , y m have to satisfy 

tmx(y 1 ,...,y m ) < mm(xi, . . . ,x„) or max(xi, . . . ,x n ) < min(y 1 ,...,y m ). (6) 

The probability of this event is given by 



nj ((1 - x)"- ] G m (x) + x"-\\- G(x)) m )dx. 



Substitute G(x) by y and let the derivative of function (1 - x) n l y m + x" '(1 - y) m according to y equal to 
zero. It leads to the equation 

/ y yn-l / X \H-1 

n-y) = U -y> ' 
Therefore the probability of event © is not minimized for F(x) = G(x) = x but for 

n ^ ( ^ 
G(x) = 



1 + (r L )" 



Some examples of distribution function given by © are in figure Q] Although we found out that two- 
sample Kolmogorov-Smirnov test is biased against alternative (0 it is really true for very small a. Let denote 
this smallest level a by a\ . Then a\ can be directly computed by 

ai =n f (1 - x)"- l x m + x"~\l - x) m dx = 2»m _ r(W)r(m) , % . (7) 
Jo 



F(n + m + 1) 
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n=50 m=20 n= 50 m= 55 n= 50 m=100 




Figure 1: Plot of distribution function G given by (0 for n = 50 and m = 20, 55, 100 



For example if n = 10 and m = 11 then a\ is equal to 5.67x10 . 

All previous result are considered for Kolmogorov-Smirnov statistic D nj „ = 1 . Let consider second high- 
est value of this statistic. For n > m it is equal to 1 - l/n and for n < m it is equal to 1 - 1 /m, respectively. 
We denote by ai the significance level a such that we reject two-sample Kolmogorov-Smirnov test if and 
only if D, hm > max(l - l/n, 1 - 1 /m). 

Firstly, assume that n > m > 2 and consider that D nm = 1 — l/n. It can occur if and only if these samples 
are such that X{\) < . . . < X(„_i) < < X(„) or jc^) < y( m ) < xp), . . . < X( n ). Together with the case D, hm = 1 
(x(„) < or y ( ,„) < we have that D„ m is greater or equal to 1 - l/n if and only if JC(„_i) < or 
y(m) < X{2)- It leads as to the probability of rejecting the hypotheses at level ar2 

P{D n , m > 1 - l/n) = PQ/jyj > x ( „_d) + PQ/jyj < x (2) ) 

= n(n - 1) J (x"- 2 (l -x)(l- G(x)) m + x(l - x)"- z G m (x)) dx. (8) 

As in proof of previous theorem let G(x) = y and let the derivative of interior function of integral © 
according to y equal to zero. It leads us to solve the equation 

/ y \m— 1 / X \n— 3 

The solution y as a function of x is given by 

y = <?<*) - - • (9) 

l+(^ 

Now assume that 2 < n < m and consider D„ m - 1 - 1/m. It can be true if and only if y^) < . . . < 
y(„,-i) < X(i) < y(m) or y(i) < X( n ) < y(2), ■ ■ ■ < y{m)- Therefore the probability of event D n , m > 1 - 1/m is equal 
to 

PCAw > 1 - 1/m) = P(Z)„,,„ = 1 - 1/m) + P(D n , m = 1) 
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nm j ((1 -x)"- l G m - l (x)(l - G(x)) + x"-\\ - G(x)) m - l G(x))dx 

+n J ((1 - x)"- l G m (x) + x n -\l - G(x))'")dx. (10) 



As before let G(x) = y and let the derivative of interior function of integral ( fTOb according to y equal to zero. 
It leads us to the equation 

/ y \m-3 / X yi 

M -y> = ' 1 -x> 



)m— 3 / X \n—\ 
~ U -x 

Therefore the distribution function of most biased distribution in this case is given by 



(^£_)S 

y = G(X) = ^ ,., . (11) 
!+(_£_)- 

Remark 2.3. /fn = 3 and m — 2 or n — 2 and m — 3 then the most biased distribution is discrete distribution 
given by probabilities P(y — 0) = P(y = 1) = \ or P(y — i) = 1, respectively. 

Consider G(x) = x then level is given (according to ([8]) and ( fTOt ) by 

rwxm) 

«2 = Inmk = ka\, (12) 

T(n + m + 1) 

where A; = min(« + 1 , m + 1). Distribution functions (O and ( fTTT i are similar to S -curves on figure[T] Although 
these distribution functions are not equal to themselves and to (0 as well, some interesting results can be 
found. If \n - m\ -2 then © and (fTTT i change to G(x) = x. It means that the distribution which minimize © 
and ([Tol l is uniform distribution. It leads us to the following theorem. 

Theorem 2.4. Let a„ m be given by f l72D . If n — m + 2 or n — m — 2 then two-sample Kolmogorov-Smirnov 
test is unbiased at level a„ M . Moreover, if n j= m and \n — m\ + 2 then Kolmogorov-Smirnov test is biased at 
level a n . m . 

Proof. Because of a„^ m -a2, the most biased distribution functions are given by <j9j and ( fTTT i. For \n - m\ = 2 
they change to G(x) = x = F(x). It means that the uniform distribution minimize the probability of rejection 
hypotheses F - G against alternative F + G at level a-i if and only if \n - m\ — 2. a 

Remark 2.5. If \n — m\ — 1 then Kolmogorov-Smirnov test is not biased against the distribution functions 
(O and ([77} at level a\. 

Let denote by srf a the set of distributions for which Kolmogorov-Smirnov test is biased at level a, it is 

s^ a = {G : P(reject H at level ar|alternative G is true) < a). 

For different levels < a < a*, one would expect that there is some subset relation between srf a and srf a > . But 
it is not generally true. According to the theorem [2~4l there exist G a such that G a € s^ a and G a t sd a >. On 
the other hand, from remark [231 we have that there exists G* such that G* t si a and G* e srf a *. Therefore, 
in general si a is not subset of srf a . and vice versa. 
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The previous result can be quite simply generalized to (the third smallest a) in case of n > 2m or 
2n < m. Adding the probability of the even D n m = 1 - 2/m or D n m = 1 - 2/n to the dHJ or (TlOb leads us to 
the most biased distributions at level given by 

G 3 (x) = — — r ifn>2m (13) 

v Y—x J 

or 

G 3 (x) = ^E^If if m > 2«. (14) 

r(»)r(m) 

»3 = Ik^nm = K2»i, 

T(« + m + 1) 

where fe 2 = ™"«™+W).("+2X"+'>> . If „ = m + 4orm = „ + 4 then G 3 (jc) = x. Together with condition 
« > 2m or m > 2n we have that for n — 6,m - 2 or n — 2,m = 6 the two-sample Kolmogorov-Smirnov test 
is unbiased at level = 3/7 and for n = l,m = 3 or n = 3,m = 7 the two-sample Kolmogorov-Smirnov 
test is unbiased at level cr? = 1/6. 



In this case, is given by 



Sofar considered a's are too small in case we have some tens of observation in each sample. Therefore we 
perform the following simulation to look if two-sample Kolmogorov-Smirnov test is biased against the dis- 
tribution (|5]l at level a ss 0.05. We set the number of observation n for the first sample be n — 10, 20, 50, 100 
and the number of observation m for the second sample bem = 11, 15,21,51, 101. As a distribution of the 
first sample we consider uniform distribution and for second sample we consider two distributions. The first 
one is the uniform distribution and the second one is distribution having distribution function G given by ©. 
We perform 10000 repetitions and compute the difference between the estimate of power if second sample is 
from alternative distribution and the estimated level a if the second sample is from uniform distribution. The 
results of this simulation are in table Q] We can see that for all considered n and m the estimate of difference 
is greater than 0. It means that two-sample Kolmogorov-Smirnov test is not biased against alternative © at 
level a = 0.05. 



Table 1 : Difference between estimate of power for alternative G given by © and estimate of level a of 
two-sample Kolmogorov-Smirnov test. 



a = 5% 


m=ll m=15 m=21 m=51 m=101 


n = 10 
n = 20 
n = 50 
n = 100 


0.0034 0.0144 0.0320 0.4153 0.7290 
0.0291 0.0087 0.0016 0.2784 0.9170 
0.4071 0.3403 0.2715 0.0001 0.5291 
0.9070 0.9189 0.9190 0.4557 0.0001 



3 Conclusion 

In this paper we looked at biasedness and unbiasedness of two-sample Kolmogorov-Smirnov test. In case 
of different sample sizes this test is not unbiased. However we found out that it is not true for all a e (0, 1). 
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There exists some special combination of number of observations in each sample and significance level a at 
which this test is unbiased (see e.g theorem [2~4l i. Moreover, we discovered the most biased distribution for 
some values of a. Although we consider just small values of a, for small sample sizes or for data such as 
gene expressions these levels of a are appropriate. We did not consider all levels of a. However we point 
out that this test can be unbiased for large samples and a around 0.05. However more research is needed to 
find out the exact relation between number of observations and level a at which this test is unbiased. 
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